-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fast_create_table: add retry for local worker #52543
Conversation
Hi @GMHDBJD. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/ddl/id_allocator.go
Outdated
"github.com/pingcap/tidb/pkg/meta" | ||
) | ||
|
||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments to let user easy to know the func and parameters usage?
/retest |
@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #52543 +/- ##
=================================================
- Coverage 74.7101% 56.2759% -18.4343%
=================================================
Files 1545 1669 +124
Lines 362047 612862 +250815
=================================================
+ Hits 270486 344894 +74408
- Misses 71923 244527 +172604
- Partials 19638 23441 +3803
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -348,7 +348,7 @@ func TestProcessChunkWith(t *testing.T) { | |||
require.Len(t, progress.GetColSize(), 3) | |||
checksumMap := checksum.GetInnerChecksums() | |||
require.Len(t, checksumMap, 1) | |||
require.Equal(t, verify.MakeKVChecksum(111, 3, 14231358899564314836), *checksumMap[verify.DataKVGroupID]) | |||
require.Equal(t, verify.MakeKVChecksum(111, 3, 4697906519963456140), *checksumMap[verify.DataKVGroupID]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checksum changed due to different table id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
pkg/ddl/ddl_worker.go
Outdated
tasks = newTasks | ||
} | ||
// TODO: since ticdc doesn't support batch create tables now, we disable until ticdc support it. | ||
// if newTasks, err := d.combineBatchCreateTableJobs(tasks); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they can support it in the short term, feel free to leave it. If they don't support it for a long time, can we just delete it first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They said it has not been scheduled yet, and we hope to support CDC in 8.1, so disable it first.
var err error | ||
for i := 0; i < maxRetryTime; i++ { | ||
err = wk.HandleLocalDDLJob(d.ddlCtx, job) | ||
if err == nil || !isRetryableError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that if the job itself is not finished (job.IsFinished
is false), it will retry itself. Why is there a special emphasis here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for fast create table, the job does not exist in the ddl job table, so the original retry mechanism (taken out from the ddl job table and re-executed) will not take effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, could we add a comment here?
pkg/ddl/id_allocator.go
Outdated
} | ||
|
||
type allocator struct { | ||
cache []int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cache exists, if there are multiple tidbs in a cluster, is the following possible?
TiDB1: Execute DDL1, DDL2...DDL201 on table tblx. ,
TiDB2: Executes DDL302 (job.ID: 302) on table tbl;
After 5 minutes, DDL201 is not finished
TiDB1: Execute DDLn (job.ID: 202) on table tbl. ,
Then DDLn will start executing 5 minutes later than DDL302, but the actual execution will start and finish first.
There's no error problem, but it's weird. In addition, this ID will also be used for the ID of the table, which may also have similar problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's possible. See id_allocator_test.go. If multiple tidbs execute ddl at the same time, the order of job ids is no longer guaranteed. And there may be id gaps between two jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the previous implementation also had similar problems. For example, the job that requested the ID later was executed for a long time before being committed. The order of the IDs was not guaranteed and had no impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original ID order is not guaranteed, but the interval is relatively small. It doesn't seem to be a problem at the moment, but this feels kind of weird. table ID also has this problem. Does this give an optimized performance comparison? See if the optimization effect is obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance is one aspect, but more is robustness. When multiple tidbs allocate IDs at the same time, it will cause a lot of write conflicts, and changing to pessimistic transactions will also affect performance. see #27197
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If you change step to 10, the cases I mentioned above are less.
- The occurrence probability of this scenario(#27197) you mentioned is relatively small. In addition, theoretically we have limitDDLJobs, so this kind of problem should not occur for a TiDB. But we use the same key for tableID and jobID, so a create table operation may result in multiple ID assignments.
Suggestion: If we put tableID and jobID assignments together, this conflict will be reduced a lot. And, it does not affect the old logic (which considers the ID assignment to be relatively incremental).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we already have global id lock in the original logic, which means that there will no longer be a write conflict in the same tidb, but in multiple tidbs, this problem cannot be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
txn.SetOption(kv.Pessimistic, true)
doesn't mean pessimistic transactions, we have to lock the required key explicitly, i tested lock + incr(global id)
using 1000 routines, qps is about 1k
Suggestion: If we put tableID and jobID assignments together, this conflict will be reduced a lot
agree with this, we can consider caching id after this optimization, if this part is still worth optimize.
FYI i have a on-going pr separating local ddl jobs from normal ones in addBatchDDLJobs
, and change job id allocation + insert into table
run in one txn, to make sure job submitted in id order
/retest |
@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest |
@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
pkg/ddl/ddl_worker.go
Outdated
// lock to reduce conflict | ||
d.globalIDLock.Lock() | ||
|
||
ids, err = d.genGlobalIDs(len(tasks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this optimize only works when we connects to multiple Nodes to do create-table
have you any data on how much this can improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/retest |
@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
f8af8a5
to
979cb87
Compare
/retest |
@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest |
@GMHDBJD: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: D3Hunter, lance6716, tangenta The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
3 similar comments
/retest |
/retest |
/retest |
What problem does this PR solve?
Issue Number: ref #49752
Problem Summary:
What changed and how does it work?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.